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ABSTRACT 



A method and apparatus for extracting structure information 
from an unstructured electronic document is described. The 
method includes the step of identifying a structural type for 
each instance in the electronic docimient by examining 
presentation attributes associated with each instance. 
Examples of presentation attributes which are examined 
include nimibering formats, indentations, and font sizes and 
weights. 
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STRUCTURE EXTRACTION ON 
ELECTRONIC DOCUMENTS 

BACKGROUND OF THE INVENTION 

The present invention relates to techniques that identify 
and categorize paragraphs, subparagraphs, and structural 
groupings in electronic documents, and more particularly to 
techniques that build a structure hierarchy from structural 
groupings. 

An electronic document typically has information 
content, such as text, graphics, and tables, and formatting 
information that directs how the content is to be displayed. 
An electronic document resides on a digital, though not 
necessarily electronic, computer storage medium. An elec- 
tronic document is generally provided by an author, 
distributor, or publisher who desires that the document be 
viewed with the appearance with which it was created. 
Electronic documents may be widely distributed and, 
therefore, can be viewed on a great variety of hardware and 
software platforms. A hypertext document is an electronic 
document with Unks, which are explicit, user-selectable 
navigation elements. 

Generally, electronic and human perceptible documents 
include a set of paragraphs. Each instance of a paragraph 
shares characteristics with other paragraphs. Paragraphs that 
share visual characteristics can be considered the same 
structural type. Examples of structural paragraph types are 
titles, headers, and footnotes. 

In addition, in all documents, paragraphs can have 
subparagraphs, which are character streams. Each instance 
of a subparagraph shares similar characteristics with other 
subparagraphs that are the same structural type. Examples of 
subparagraph structural types are book titles, quotations, and 
foreign words and phrases, 

A document typically has a logical organization. Within 
the logical organization are identifiable structiu-al groups, A 
series of chapters containing paragraphs is an example of a 
structural group, as is a section that contains a heading, 
several paragraphs, and a bulleted list. 

Organizing components in an electronic document by 
structural type permits an electronic document development 
system to perform global operations on all instances of the 
same type within the electronic document. For example, the 
FrameMakei® document publishing system, available from 
Adobe Systems Incorporated of San Jose, Calif, can globally 
change the justification of all paragraphs tagged as a par- 
ticular type in the electronic document and can globally 
change the font size of all characters tagged as a particular 
type in the electronic doctmient. 

Standard type formats exist for particular uses and for 
particular systems. For example, the HyperText Markup 
Language (HTML) uses the embedded tags <P> and </P> to 
delimit paragraphs, and <B> and </B> to delimit bold text. 
HTML also specifics many other tags including tags for 
titles, menus, definitions, quoted blocks, and heading styles. 
For an electronic document to have the desired visual 
appearance when viewed with a World Wide Web browser, 
the electronic document must have the appropriate HTML 
tags. 

When viewed on paper or on a computer display, the 
different structural paragraph types in a document, such as 
headings and lists, are readily identifiable. However, to 
enable a system to perform operations based on structural 
types, such as modifying, rearranging; displaying, or print- 
ing a. document, will generally require that someone exam- 
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ine and tag all paragraphs and subparagraphs manually 
according to their visually recognized strucmral type. This is 
tedious and time consuming, and often an impracticable 
process for large documents. 

^ SUMMARY OF THE INVENTION 

In accordance with the present invention, a method of 
extracting structure information from an electronic docu- 
ment includes the step of identifying a structure type for 

10 each instance in the electronic document by examining 
presentation attributes associated with each instance. With 
such an arrangement, an unstructured electronic document 
can be provided with structtiral tags. 
Among the advantages of the invention are one or more 

15 of the following. The invention enables an electronic docu- 
ment development system to perform global operations on 
all paragraph and subparagraph instances by structtu*al type. 
Global operations include, but are not Umited to, format 
changes, searches, word and phrase replacements, and 

20 extractions. The invention enables the electronic document 
development system to perform operations on the structure 
of the electronic document (e.g., rearrange the hierarchy or 
subdivide the structure). The invention permits an electronic 
document to be rearranged or divided according to structural 

25 groupings of the document. The invention enables an elec- 
tronic document to be rearranged based on full sections and 
identifiable units. The invention enables the document to be 
split based on logical organization. 

Other features and advantages of the invention will 

30 become apparent from the following description and from 
the claims. 

BRIEF DESCRIPTION OF THE DRAWINGS 

FIG. 1 is a schematic diagram of a computer platform 
35 suitable for supporting a structural extractor system in 
accordance with the present invention; 

FIG. 2 is a flow chart of a method for converting and 
dividing an electronic document for use by another docu- 
mentation system; 

FIG. 2a is m example of a table that maps source to 
destination tags; 

FIG. 3 is a flow chart of a method that gathers statistical 
information about an electronic document for use in the 
method of FIG. 2; 

FIG. 4 is a flow chart of a method for mapping presen- 
tational attributes in a document to structural types useful for 
the method of FIG. 2; 

FIG. 5 is a flow chart of a method for determining whether 
a paragraph is a heading; 

FIG. 6 is a flow chart of a method that sorts heading types; 

FIG. 7 is a flow chart of a method that sorts headings 
according to name; 

FIG. la is a table that contains heading levels; 
55 FIG. 8 is a diagram that shows a hierarchy of instances in 
a document; 

FIG. 9 is a flow chart of a method that buflds a hierar- 
chical tree structure for an electronic document; 

FIG. 9a is a table that maps a link in a source electronic 
60 document to a destination electronic document; and 

FIG. 10 is a flowchart that shows a method that divides a 
tree structm-e into subdocuments. 

DESCRIPTION OF THE PREFERRED 
g5 EMBODIMENTS 

Referring to FIG. 1, a general purpose computer platform 
100 suitable for supporting an electronic document devel- 
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opmenl system and including a stmctural extractor 102 is instances for each original paragraph type (step 340). The 

shown. The platform (e.g., a personal computer or system also computes the average batch-length for an origi- 

workstation) includes a digital computer 104, a display 106, nal paragraph type by dividing the number of instances of 

a keyboard 108, a mouse 110 or other pointing device, and the same type by the number of batches of that type (step 
a mass storage device 112 (e.g., hard disk drive, magneto- 5 350). These statistics are used in conjunction with the 

optical disk drive, or floppy disk drive). The computer 104 method described in reference to FIG. 5. 

is of conventional construction and includes memory 120, a ^ checklist containing factors is used to map paragraph 

processor 122, and other customary components (e.g., and subparagraph types in a source electronic document to 

memory bus and peripheral bus). structural types in a destination electronic document. One 

An electronic document 130 stores information on a hard factor is the combination of existing style settings, such as 
disk or other computer readable medium such as a diskette. relative character weights, relative font sizes, italic 
An electronic document is viewable in a human perceptible characters, and indentations. Other factors include the place- 
representation on a computer display 132 or as a hardcopy ment of paragraphs and subparagraphs in relationship to 
printout through operation of a computer program. other paragraphs and subparagraphs, numbering schemes, 

A system capable of converting presentational attribute and repetitive structures (e.g., buUeted lists), 

information in a source electronic document to a pre- The system maps each source type to a destination type, 

determined set of structural types in a destination electronic A mapping between types can occur although the source 

document would eliminate human intervention in the con- type does not have all characteristics of a destination type, 

version process. Furthermore, such a system is useful for However, the more characteristics present, the higher the 

structuring a document as a structural tree hierarchy. Such a probability that the source type is a specific destination type, 

system would need to identify and map standard and user- Referring to FIG. 4, the steps used to map paragraph 

defined types in the source electronic document to appro- presentational attribute types in the source system to struc- 

priate structural types in the destination electronic docu- tm-al paragraph types in the destination system are shown. 
CQcnt. ^ Xhe system gets the next original paragraph type from a 

Other processes that typically requires human interven- catalog that defines all original paragraph types for the 
tion are rearranging parts of an electronic document and electronic document (step 402). The system determines 
dividing an electronic document into electronic sub- whether the paragraph type is a heading (step 404) or a list 
documents. A system capable of identifying logical breaks, element (step 408). If the paragraph is neither, the system 
for example, at the begirming of a chapter or a section, can identifies it as a default paragraph type (step 410). Before 
automate the rearrangement and division processes. Such mapping a type to a default paragraph (step 410), the system 
automated processes must maintain links (e.g., hypertext can perform additional tests to determine whether the para- 
links) to other components in the electronic document. graph is a footnote, bibliographic element, quoted passage. 

Referring to FIG. 2, a system 200 thai identifies format- and so forth, 
ting styles or types received as input a source electronic 35 To determine if the paragraph structural type is a Ust, the 

document and outputs one or more electronic documents system checks whether the paragraph has an automatically 

with structural types recognizable by a destination system. generated prefix, which indicates that the paragraph is an 

The destination system may be the same or a different ordered list. In a FrameMaker source document, for 

system than the source system. The system examines an example, if a format contains the characters "<" and" >", the 
electronic document and collects statistics about the para- ^ system tags the paragraph as an ordered list (step 416). 

graph instances in the electronic document (step 202). Using These characters enclose a code that specifies a quantity, 

this information, if the source electronic document has such as a number, that varies for each instance. Otherwise, 

original presentational attribute information in the form of a the system identifies the structural paragraph type as an 

named type, the system creates a tag table, having at least unordered fist (step 414), 

two-columns as shown in BG. 2a, mapping each original 45 Referring to FIG. 5, the system considers a number of 

paragraph type in the source electronic document to the factors to determine if the structural paragraph type is a 

structural type for the destination system (step 204). The tag heading. The system checks the placement of the paragraph 

table can contain information that indicates if a paragraph (sj^p 510) and if the placement is on the side of a page within 

havmg that type can separate from the electronic document. ^n area predominated by white space, the original paragraph 

Structural types serve as a basis for buUding a tree 50 type maps to a heading (step 590). If the name of the 

structure (step 206) that represents the structural organiza- paragraph type begins with the letter "H" or "h" , and ends 

tion of instances in the electronic document. The system can with a number (step 520), the paragraph type maps to a 

optionally divide the electronic document into subdocu- heading (step 590). If the name of the paragraph type is 

ments (step 208). Smaller files are easier to download and "Title" (step 530), the paragraph type maps to a heading 
view using a World Wide Web browser, for example. The 55 (step 590). If the paragraph type has at least one instance and 

system also can create output files (step 210) with structural there is at least one batch (step 540), the system uses the 

type tags that the destination system will recognize. statistics gathered to create a weighting factor. This factor is 

Referring to FIG. 3, the system gathers statistics on each the inverse of the average batch-length multiplied by the 

paragraph. The system can gather statistics while reading the average nimiber of lines (step 550). If the paragraph type is 
source electronic document during one or more passes. 60 automatically numbered, this weighting factor is multiplied 

Statistics include the number of instances of each original by the empirical constant 1.5 (step 560). If the paragraph 

paragraph type (step 310), the total number of lines for all type is straddled (i.e., spans across multiple columns), the 

instances having the same paragraph type (step 320), and the weighting factor is multiplied by the empirical constant 1.5 

number of groups of consecutive instances of a particular (step 570). If the paragraph type is automatically numbered 
type (step 330). Each group of consecutive instances of the 65 and is straddled, the weighting factor increases twice. The 

original type is referred to as a batch. From these statistics, system compares the weighting factor to the constant 0.9 

the system computes the average number of lines for all the (step 580) and if the weighting factor is greater, the para- 
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graph type is classified as a heading (step 590). As the 
system assigns headings, it builds a heading table listing 
each heading type, as shown in FIG. 7a , 

After classifying all original paragraph types defined in 
the paragraph catalog, the system considers all original 
subparagraph types defined in the character catalog. In one 
embodiment, all character types are mapped to a default 
character tag recognized by the destination system. 
However, other embodiments may consider factors such as 
bounding entities (e.g., quotation marks, underlines, and 
parentheses), bold text, italics, and highlighting techniques. 

For paragraphs and subparagraphs that do not have tags, 
the system can analyze untagged paragraphs and 
subparagraphs, and assign appropriate tags. For example, a 
series of one-line numbered paragraphs may be tagged as an 
ordered list, or a quoted series of characters may be tagged 
as a quoted passage. 

Prior to building a tree structure (step 206) that represents 
the hierarchical organization of instances in the document, 
the system assigns a level to each structural paragraph type. 
An ordinary paragraph is assigned a level of 0. Heading and 
list types are assigned levels using a sorting technique. The 
system may use any sorting technique, for example, a bubble 
sort, quick sort, or insertion sort, using the comparison 
technique as shown in FIG. 6. The comparison technique 
selects two items at a time from a heading table, as shown 
in FIG. la, compares the two items, and assigns each a level. 
The sorting technique makes additional passes until aD items 
are ordered and assigned the appropriate level. 

As shown in FIG. 6, the technique that compares headings 
checks several attributes for two heading types, A and B. The 
system gets two tags £rom the heading mapping table (step 
602), and first checks the names of the heading tags (step 
606). 

Referring to FIG. 7, exemplary steps for comparing 
names are shown. If the heading names are similar, the 
headings may be different levels. The system checks 
whether name A and name B have the same number of 
characters and end with a number (step 702). If all characters 
except the last are identical (step 704), the last character in 
name A and the last character in name B are compared (step 
706). The heading with the greater number is deemed the 
lesser heading (step 708 and step 710). The system enters the 
level into the heading table, as shown in FIG. la. The lower 
the heading number, the closer the level is to the root in the 
tree structure. For example, the system assigns a lower level 
to a heading named Heading2 than a heading named 
Headingl, and Headingl is closer to the root node. 

Examples of other presentational attributes that the sys- 
tem checks, as shown in FIG. 6, are whether a heading 
straddles columns (step 608), the font sizes (step 610) and 
font weights (step 612) if the font is the same family, 
whether a paragraph adjoins (i.e., runs into) the following 
paragraph (step 614), the indentations (step 616), and the 
font sizes and weights from different font families (step 
618). The system checks attributes with greater weights first. 
The result of the comparison is that the A heading is more 
major (step 620) or the B heading is more major (step 622). 

The system assigns levels to list types in a similar manner 
as it assigns levels to heading types. It uses a sorting 
technique that compares names, numbering formats, 
indentations, and font sizes and weights. The system can 
include additional comparisons during the sort for other 
paragraph characteristics. 

Referring to FIG. 8, a hierarchical structure represents 
structural groups of paragraph instances in a document, 
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where each instance is represented by a node in the hierar- 
chy. The system arranges the nodes according to the logical 
flow of the document. For example, a document may have 
four segments organized as chapters 802a-c and an appen- 

5 dix 802J, two of which have sections $04a--d. Each section 
may have one or more paragraphs S06a-i. 

As shown in FIG. 9, the system begins to build a struc- 
tured representation of a document by reading an imstruc- 
mred electronic document one paragraph at a time and 

10 getting the tag for the paragraph (step 902). Each tag has a 
tag level that was assigned during the sorting phase, and 
entered into a table such as shown in FIG. la. If the tag level 
is zero (step 904), the paragraph is an ordinary paragraph. In 
this case, the system appends the paragraph at the current 

15 level (step 906). The system handles the paragraph contents 
(step 908) as terminal nodes. The contents include characters 
and links. A link causes the system to add link information 
to a link destination table, as shown in FIG. 9a, which is a 
table created during this tree-building phase. The link des- 

20 tination table includes a source electronic document 
identifier, an internal location of the link in the source and 
destination electronic documents, and a pointer to a node in 
the tree structure. 
If the tag level is not zero, the paragraph creates a section 

^ node, which represents a branch such as a heading or 
beginning of a list. If the tag level is less than or equal to the 
current level, the system walks up the tree (step 914) until 
the tag level is greater than the current level. The system 
generates a section node at this level (step 916). The section 
node begins a new branch of the tree. The node is identified 
as a section node and the contents of the paragraph are the 
first children of the node (e.g., the heading text). 
A hierarchical organization enables the system to rear- 

25 range sections and divide an electronic docximent into sub- 
files at specified branches in the tree structure. Furthermore, 
a hierarchical organization provides a means for identifying 
structural groups as branches of a tree. Branches may 
represent lists, chapters, sections, subsections, and foot- 

^ notes. A document development system can also display the 
structural organization of an electronic document and allow 
users to specify portions of the electronic docimaent as 
targets for specific operations. Such operations may include 
format changes, searches, word and phrase replacements, 

^5 and extractions on all instances in one or more strucmral 
groups. 

The system creates destination electronic documents by 
writing the paragraph instances to files according to the U-ec 
structure. The system walks the tree structure and writes the 

50 contents of each node, along with the appropriate tags to the 
file associated with the node. 

Section nodes represent paragraph instances where the 
system can divide or rearrange the electronic document. 
Using the heading table, as shown in FIG. la, specific nodes 

55 are identified as nodes where the system can spht the tree. 
In one embodiment, a tree structure may be subdivided at 
every branch, as shown in FIG. 10. To divide a structured 
electronic document in this way, the system traverses the 
tree, node by node (step 1010). The system checks for 

60 section nodes (step 1012). If the node is a section node, the 
system creates a new electronic sub-document (step 1016) 
and generates a destination identifier (step 1018) that is 
entered in the link destination table, as shown in FIG. 9a. 
The electronic sub-documents, except the first, have links 

65 from the parent electronic sub-document (step 1020). The 
link creates a natural flow from one electronic sub-document 
to the next, for example, the link may be used as a hypertext 
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link. For each node, the system labels the node with a 
destination identifier (steps 1Q22). 

The system provides several ways for dividing an elec- 
tronic document. The system can divide the electronic 
document at pre-defined levels. Levels can be designated by 5 
a user manually. Levels can be determined by pre-defined or 
automatically determined size limitations. 

A number of user interface techniques can be used to 
divide or rearrange a document. The system can display the 
tree structure in a user interface. A user can use a pointing jq 
device, such as a mouse, to specify areas in the tree to move 
or specify where the system should divide the electronic 
document. 

When a file is rearranged or divided, the system maintains 
all internal and external dociunent links. The link destination ^ ^ 
table, as shown in FIG. 9a, that the system created during the 
tree-building phase makes this possible. When the system 
encounters a link node, the system resolves the link by 
finding the row in the link destination table containing the 
entry for the source electronic document, the link location, 
and the destination node having the destination identifier. 

Other embodiments are within the scope of the following 
claims. For example, the order of performing steps of the 
invention may be changed by those skilled in the art and still 
achieve desirable results. Weighting factors may be 
changed. Additional steps and additional factors may be 
added. Steps and factors may also be omitted. Name check- 
ing can be extended to include headings with foreign names. 

What is claimed is: 

1. A computer-implemented method for inferring struc- 
ture information in an electronic document, comprising: 

identifying a plurality of paragraph types in a source 
electronic document; 

gathering statistics for the paragraph types in the source 
electronic document, wherein the statistics are based on 
a count of paragraph instances having the same one 
paragraph type assigned, a count of lines of paragraph 
instances having the same one paragraph type, and a 
count of batch length of one paragraph type; 

mapping each paragraph type to one of a plurality of 
structural types; and 

using the statistics for the paragraph types to determine 
the structural type for one paragraph type. 

2. The method of claim 1, wherein mapping one para- 
graph type to one structural type comprises comparing a 
name of the one paragraph type to a word connoting one of 
the plurality of structural types. 

3. The method of claim 1, further comprising: 
identifying a plurality of subparagraph types assigned to 

a plurality of characters in the source electronic docu- 
ment; and 

mapping each subparagraph type to a pre-defined char- 
acter type by examining a plurality of presentational 
attributes of the subparagraph type. 

4. A computer-implemented method for inferring struc- 
ture information in an electronic document, comprising: 

identifying a plurality of paragraph types in a source 

electronic document; 
mapping each paragraph type to one of a plurality of 

structural types, this mapping comprising: 60 

examining a paragraph placement for a first one of the 
paragraph types; 

comparing a count of paragraph instances to which the 
first one of the paragraph types is assigned to 0 and 
a count of batches for the first one of the paragraph 65 
types to 0 if the paragraph placement is not a side 
placement; 
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examining a first character and a last character of a 
name of the first one of the paragraph types if the 
paragraph placement is not a side placement and the 
count of paragraph instances and the count of 
batches is 0; and 

comparing the name of the first one of the paragraph 
types to a word connoting a first one of the structural 
types if the first character and the last character of the 
name do not connote first one of the structural types. 

5. The method of claim 4, wherein mapping one para- 
graph type to one structural type further comprises: 

computing a weighting factor if the count of paragraph 
instances and the count of batches are greater than 0; 
and 

determining the probability that the name connotes the 
one structural type by comparing the weighting factor 
to a predetermined value. 

6. The method of claim 5, wherein computing the weight- 
ing factor comprises: 

setting the weighting factor to an inverse of an average 

batch length multiplied by an average number of lines; 
multiplying the weighting factor by 1.5 if the first one of 

the paragraph types is automatically numbered; 
multiplying the weighting factor by L5 if the first one of 

the paragraph types straddles multiple columns; 
comparing the weighting factor to 0.9; and 
mapping the first one of the paragraph types to a heading 

if the weighting factor exceeds 0.9. 

7. A computer-implemented method for constructing a 
hierarchical organization of paragraph instances from an 
unstructured electronic document, comprising: 

assigning one of a plurality of hierarchical levels to one of 
a plm-ality of structural types in an unstructured elec- 
tronic document, wherein this assigning comprises 
sorting the structural types that are a heading structural 
type by structural name, assigned font, and indentation 
specification; 

associating one of a plurality of paragraph instances in the 
unstructured electronic document with one of the plu- 
rality of structural types; and 

constructing a hierarchical organization of paragraph 
instances using the structural type with which each 
paragraph instance is associated and the hierarchical 
level assigned to the strucmral type. 

8. The method of claim 7 wherein sorting by structural 
name comprises: 

comparing a length of a first structural name to a length 
of a second structural name; 

comparing a last character in the first structural name and 
a last character in the second structural name if the first 
structural name and the second structural name are the 
same length, end with a number, and have identical 
characters except the last character; and 

designating a more major heading to the one of the first 
structural name and the second stmctural name having 
a greater last character. 

9. A computer-implemented method for constructing a 
hierarchical organization of paragraph instances from an 
unstructured electronic document, comprising: 

assigning one of a plurality of hierarchical levels to one of 
a plurality of structural types in an unstructured elec- 
tronic document, wherein this assigning comprises 
sorting each structural type that is a list structural type 
by structural name and indentation specifications; 

associating one of a plurality of paragraph instances in the 
unstructured electronic document with one of the plu- 
rality of structural types; and 
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constructing a hierarchical organization of paragraph 
instances using the structural type with which each 
paragraph instance is associated and the hierarchical 
level assigned to the structural type. 

10. A computer-implemented method for constructing a 
hierarchical organization of paragraph instances from an 
unstructured electronic document, comprising: 

assigning one of a plurality of hierarchical levels to one of 
a plurality of structtiral types in an imstructured elec- 
tronic document; 

associating one of a plurality of paragraph instances in the 
unstructured electronic document with one of the plu- 
rality of structural types; and 

constructing a hierarchical organization of paragraph 
instances using the structural type with which each 
paragraph instance is associated and the hierarchical 
level assigned to the structural type wherein this con- 
structing comprises: 

appending a paragraph instance to a current tier if the 
hierarchical level of the structural type with which 
the paragraph instance is associated is equal to 0; 

assigning the current tier to a parent level having a 
lesser tier value if the hierarchical level of the 
structural type with which the paragraph instance is 
associated is not 0 and is less than or equal to the 
current tier, until the hierarchical level of the struc- 
tural type with which the paragraph instance is 
associated is greater than the current tier; and 

generating a section node to begin a new branch of the 
hierarchical organization if the hierarchical level of 
the structural type with which the paragraph instance 
is associated is not 0 and is greater than the current 
tier. 

U. A computer program for constructing a hierarchical 
organization of paragraph instances from an unstructured 
electronic document, comprising instructions operable to 
cause a computer to: 

associate one of a plurality of paragraph instances in an 
unstructured electronic document with one of the plu- 
rality of structural types; 
assign one of a plurahty of hierarchical levels to one of a 

plurality of structural types; and 
construct a hierarchical organization of paragraph 
instances using the structural type with which each 
paragraph instance is associated and the hierarchical 
level assigned to the structural typG, the instructions to 
construct a hierarchical organization of paragraph 
instances comprising instructions to: 
append a paragraph instance to a current tier if the 
hierarchical level of the structural type with which 
the paragraph instance is associated is equal to 0; 
assign the current tier to a parent level having a lesser 
tier value if the hierarchical level of the structural 
type with which the paragraph instance is associated 
is not 0 and is less than or equal to the current tier, 
until the hierarchical level of the structural type with 
which the paragraph instance is associated is greater 
than the current tier; and 
generate a section node to begin a new branch of the 
hierarchical organization if the hierarchical level of 
the structural type with which the paragraph instance 
is associated is not 0 and is greater than the current 
tier. 

12. A computer program for constructing a hierarchical 
organization of paragraph instances from an unstructured 
electronic document, comprising instmctions operable to 
cause a computer to: 
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associate one of a plurality of paragraph instances in an 
unstructured electronic document with one of the plu- 
rality of structural types; and 

assign one of a plurality of hierarchical levels to one of a 
plurality of structure types, the instructions to assign 
comprising instructions operable to cause a computer to 
sort the structural types that are a beading structural 
type by structural name, assigned font, and indentation 
specification. 

13. The product of claim 12, wherein instructions to sort 
by structural name comprise instructions to: 

compare a length of a first structural name to a length of 
a second structural name; 

compare a last character in the first structural name and a 
last character in the second structural name if the first 
stmctural name and the second structural name are the 
same length, end with a number, and have identical 
characters except the last character; and 

designate a more major heading to the one of the first 
structural name and the second structural name having 
a greater last character. 

14. A computer program for constructing a hierarchical 
organization of paragraph instances from an unstructured 
electronic document, comprising instructions operable to 
cause a computer to: 

associate one of a plurality of paragraph instances in an 
unstructured electronic document with one of the plu- 
rality of structural types; and 

assign one of a plurality of hierarchical levels to one of a 
plurality of structural types, the instructions to assign 
comprising instructions operable to cause a computer to 
sort each structural type that is a list structural type by 
stmctural name and indentation specification. 

15. A computer program for inferring stmcture informa- 
tion in an electronic document, comprising instructions 
operable to cause a computer to: 

identify a plurality of paragraph types in a source elec- 
tronic document; 

gather statistics for the paragraph types in the source 
electronic document, wherein the statistics are based on 
a count of paragraph instances having the same one 
paragraph type assigned, a count of lines of paragraph 
instances having the same one paragraph type, and a 
count of batch length of one paragraph type; 

map each paragraph type to one of a plurality of structural 
types; and 

use the statistics for the paragraph types to determine the 
stmctural type for one paragraph type. 

16. The computer program of claim 15, wherein instruc- 
tions to map one paragraph type to one stmctural type 
comprise instmctions to compare a name of the one para- 
graph type to a word connoting one of the plurality of 
stmctural types. 

17. The computer program of claim 15, further compris- 
ing instructions to: 

identify a plurality of subparagraph types assigned to a 
plurality of characters in the source electronic docu- 
ment; and 

map each subparagraph type to a pre-defined character 
type by examining a plurality of presentational 
attributes of the subparagraph type. 

18. A computer program for inferring stmcture informa- 
tion in an electronic document, comprising instmctions 
operable to cause a computer to: 

identify a plurality of paragraph types in a source elec- 
tronic document; 
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map each paragraph type to one of a plurality of structural 
types, the instructions to map comprising instructions 
to: 

examine a paragraph placement for a first one of the 
paragraph types; 

compare a count of paragraph instances to which the 
first one of the paragraph types is assigned to 0 and 
a count of batches for the first one of the paragraph 
types to 0 if the paragraph placement is not a side 
placement; 

examine a first character and a last character of a name 
of the first one of the paragraph types if the para- 
graph placement is not a side placement and the 
count of paragraph instances and the count of 
batches is 0; and 
compare the name of the first one of the paragraph 
types to a word connoting a first one of the structural 
types if the first character and the last character of the 
name do not connote first one of the structural types. 
19. The computer program of claim 18, wherein instruc- 
tions to map one paragraph type to one structural type 
comprise instructions to: 
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compute a weighting factor if the count of paragraph 
instances and the count of batches are greater than 0; 
and 

determine the probability that the name connotes the one 
structural type by comparing the weighting factor to a 
pre-determined value. 
20. The computer program of claim 19, wherein instruc- 
tions to compute the weighting factor comprise instructions 
to: 

set the weighting factor to an inverse of an average batch 

length multiplied by an average number of lines; 
multiply the weighting factor by 1.5 if the first one of the 

paragraph types is automatically numbered; 
multiply the weighting factor by 1.5 if the first one of the 

paragraph types straddles multiple columns; 
compare the weighting factor to 0.9; and 
map the first one of the paragraph types to a heading if the 

weighting factor exceeds 0.9. 



07/19/2004, EAST Version: 1.4.1 



UNITED STATES PATENT AND TRADEMARK OFHCE 

CERTIFICATE OF CORRECTION 



PATENT NO. : 6,298,357 Bl Page 1 of 1 

DATED : October 2, 2001 

INVENTOR(S) : Jeffrey C. Young and Michael E. Wexler 

It is certified that error appears in the above- Identified patent and that said Letters Patent is 
hereby corrected as shown below: 



Title page. 

Item [56] References Cited, U.S. PATENT DOCUMENTS, please replace "5,781,785 
* 4/1999" with --5,781,785 * 7/1998 ~. 

Column 8, 

Line 64, please replace "indentation specifications;" with - indentation specification; - 



Signed and Sealed this 
Fourth Day of June, 2002 



Attest: 




JAMES E. ROG AN 

Auestmg Officer Director of the United States Patent and Trademark Office 



07/19/2004, EAST Version: 1.4.1 



